*******************************;
* Contents of dofiles          ;
*******************************;

Title: The Consequences of Industrialization: Evidence from Water Pollution and Digestive Cancers in China
Date : June 2010

*******************
Data Preparation
*******************

1. water_pollution.do. Uses Pollutants_Counties_watersheds.dbf, stored
in GIS/water_pollution. This creates the basin-level and provine-level
water quality measures.
2. air_pollution.do. Uses longterm_particulatesbybasin.csv from the
GIS/air_pollution directory to create basin-level and province-level
air quality measures.
3. rainfall.do. This brings in the precipitation data at the basin-level
from the rainfall GIS directory and also produces province-level
precipitation measures.
4. census_data.do. This creates ~/pollution/datafiles/census_data.dta
from the original county-level census tabulations. There are 2,873
observations. This is currently not used in the analysis.
5. waterpoints.do. This program takes the .csv files of the water
points from the GIS directory, assigns them a river basin, assigns them
stream information. This creates a cleaned version of the world bank
data, and waterpoints_data.dta.
6. riversystem_info.do. This proram takes .csv files from the GIS folder
and saves them as STATA. It also merges with waterpoints_data.dta. The
key here is it creates riversystem_info.dta, which a county-level data
for 2000 where we observe the closest water point and the basin
upstream of county.
7. diet_data.do. Creates diet_data.dta from the CHNS.
8. smoking_data.do. Creates smoking_data.dta from the CHIS.

**************Dumping/Water Quality analysis***************

9. dumping.do. This creates dumping1990to2006.dta which is province X
year dumping records. This synthesizes the dumping, levy, and output data by
province and year. (see regressions in levyregs.do).
10. county_dumping.do. This uses jing's output data by industry
in combination with dumping data by industry. Then, using a mapping
from the output locations to the basins in GIS/county_output, it
creates county level records of the total dumping by chemical. This is
where Appendix Table 5 is made.
It prepares county_dumping.dta, which is used by wqregs.do. This data
is for 2003.
11. county_output.do. It creates output by industry tabulations by
river basin.

**************Historical Output***************

12. industries.do. Created indtemp2000.dta, which is 1970-2005 county
production using 2000 census distribution of industrial employment and
provincial annual data for 1970-2005. This can be tweaked to use the
1990 distribution of employment at the county level (output_level_`year').
13. industry_basins_2000.do. Creates industry_basins_2000.dta, which is
ready for basin_regs.do.

******************
Data Combining
******************

14. dsp_basins.do. This creates dsp_basins.dta, the main data set of
145 sites with all sorts of info assigned to them. Run the aforementioned 
programs in that order. This program is really key - you could
recreate most of the paper by downloading dsp_basins.dta.

*******************
Basic Regressions
*******************

15. olsregs.do. OLS specification of digestive cancers on water grade. Table 3.
16. sexregs.do. OLS by type of digestive cancer, sex, and chemical. Table 4 - columns 1/2.
17. tapregs.do. OLS regs by tap water - Table 4 columns 3/4.
18. regs_cd.do. The regressions of cancer rates on all causes. Table 6.
chemical.
19. regs_cd_rf. OLS reduced form of rain on death rates by tap water
share. Table 7.
20. tributaryregs. 2SLS using rainfall + distance to headwaters. Table 8.
21. wqregs.do. This creates the data and executes the regression for 
for Table 9 (column 1). I combine measures of dumping, precipitation,
and water quality to estimate the impact of dumping on water quality.
22. levyregs.do. This is Table 9 (column 2 & 3), relationship between 
dumping, cleanup, and levies.
23. olsregs_step.do. OLS specification of digestive cancers on water
grade with step function. Appendix Table 3.
24. basin_regs.do Using industry_basins_2000, I show that 1970-1990
basin production is correlated with the 1990's cancer rate. Appendix
Table 6.
25. olsregs_unweighted.do. Unweighted regressions. Appendix Table 7.

*********
Tables
*********

26. tables.do. This creates Tables 1/2 of summary stats, the table 
with smoking and diet info (Table 5), the appendix table of DSP point 
summary statistics (Appendix Table 1), the appendix table of river
systems (Appendix Table 2), the clean/dirty river analysis and t-test, 
and the overall cancer rates for comparison with the US (Appendix
Table 4).

*********
* END
*********